Meeting timing requirements in speed-critical designs has always been a challenge for designers. In response, designers of FPGA-based speed-critical systems have developed some standard techniques to achieve timing closure-the point at which the timing requirements of the application are met by the actual place-and-route tool results for the design.
Because FPGAs use a larger logic building block than ASIC designs, understanding how your coding style will “map” into the logic building blocks of your target FPGA is crucial. Reducing the number of logic levels required will result in faster implementation.
For example, a simple approach to a complex state-machine design would be to create one large state machine, place and route the design, and then optimize the critical paths to reduce delay. An alternative coding style would be to break the large, complex state machine into several smaller machines of eight to 12 states each, which intercommunicate to implement the same task. This will reduce the number of logic levels required. Although it may require more logic modules to implement, it will solve timing problems at the front end of the design process and save time later. The accompanying figure illustrates a simple example of the impact coding style has on the number of levels of logic required for a state machine from a recent design. Most synthesis tools also have settings to allow state machines to be implemented in a “one-hot” approach (in which each state is represented by a separate register bit), making it easy for fan-in-limited FPGA logic blocks to decode state combinations.
In some cases the levels of logic required to implement your design are not the problem. It may be that a high fan-out signal, or perhaps the enable for a wide register bank on your PCI input data bus, is too slow. A useful technique, which is now supported in most synthesis tools, allows designers to limit the fan-out of an important signal to reduce the delay. This is accomplished by duplicating the logic required to generate the signal so that the routing delay is reduced, without needing additional levels of logic for buffering the signal. Logic duplication requires more logic modules and can increase the load on the signals upstream, but in almost every case the additional delay is a small fraction of the savings achieved.
Also, selecting the right memory architecture up front can save you from timing problems late in the design cycle. Use the dedicated memory blocks where the flow of information into and out of the memory is structured and limited in its span in the design to avoid signal congestion near the memory. Synchronous memory implementations are also preferred over asynchronous implementations so that the path through the memory becomes independent of routing delays to and from the memory.
Another important technique for achieving successful timing closure on an aggressive design is to carefully review the most critical timing constraints. In some cases, seemingly critical timing constraints may not be critical. If they can be removed or ignored you may be well on the way to an early design finish. The simple approach of achieving clock-to-clock timing for all internal signals in a synchronous design may be easy, but will usually overconstrain your design.
Designers may find that the timing challenge is not just within the FPGA but also involves timing between the FPGA and other chips on their board. A signal that requires special consideration is a clock signal going on to or off the FPGA. FPGAs that have digital delay lines (DDLs) on-chip make it easier to control timing on these critical signals.
When a clock is being driven from the FPGA output to a critical off-chip device, the DDL can be used to eliminate the clock delay from FPGA I/O to the external device. This provides the maximum amount of time for a registered data signal to travel from the FPGA to the external device, since the clock edges are synchronized by the DLL.
If a signal from the middle of the clock trace is brought back into the FPGA and is used to synchronize the I/O register clock, the trace delay on the pc board can be eliminated. This technique can be used in any high-speed signal where the relationship between an on-chip and off-chip signal needs to be controlled.
When a clock is being driven into the FPGA, the DLL can be used to eliminate the routing delay typically associated with the input buffers and routing signals inside the FPGA. The DLL can synchronize the clock signal on the input pin with the clock signal on the associated input data registers to ensure the maximum chip-to-chip timing is available.
Without a DLL, interchip timing is more difficult to control and can make timing closure difficult. Most FPGA manufacturers have detailed application notes on using their DDLs and these are “must-read” documents for any high-speed-FPGA designer.